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ABSTRACT 

Context. Morphology is the most accessible tracer of galaxies physical structure, but its interpretation in the framework of galaxy evolution 
^ > still remains a problem. Its dependence on wavelength turns indeed the comparison between local and high redshift populations difficult. 
I ' Furthermore, the quality of the measured morphology being strongly dependent on the image resolution, the comparison between different 
, surveys is also a problem. 

• Aims. We present a new non-parametric method to quantify morphologies of galaxies based on a particular family of learning machines called 
^ [ support vector machines. The method, that can be seen as a generalization of the classical CAS classification but with an unlimited number of 
dimensions and non-linear boundaries between decision regions, is fully automated and thus particularly well adapted to large cosmological 
surveys. The source code is available for download at http : //www. lesia. obspm. fr/~huertas/galsvm. html 

Methods. To test the method, we use a seeing limited near-infrared (A^, band, 2, 16yura) sample observed with WIRCam at CFHT at a median 
. • redshift of z ~ 0.8. The machine is trained with a simulated sample built from a local visually classified sample from the SDSS chosen in 
OS ' the high-redshift sample's rest-frame (i band, 0.77/jm ) and artificially redshifted to match the observing conditions. We use a 12-dimensional 
volume, including 5 morphological parameters, and other caracteristics of galaxies such as luminosity and redshift. A fraction of the simulated 
sample is used to test the machine and assess its accuracy. 

Results. We show that a qualitative separation in two main morphological types (late type and early type) can be obtained with an error lower 
^ • than 20% up to the completeness limit of the sample (KAB ~ 22) which is more than 2 times better that what would be obtained with a classical 
' C/A classification on the same sample and indeed comparable to space data. The method is optimized to solve a specific problem, offering 
rS , an objective and automated estimate of errors that enables a straightforward comparison with other surveys. Selecting the training sample in 
■ the high-redshift sample rest-frame makes the results free from wavelength dependent effects and hence its interpretation in terms of evolution 
" " ' easier. 

Conclusions. 

Key words, galaxies: fundamental parameters - galaxies: high redshift 
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1. Introduction 

The process of galaxy formation and the way galaxies evolve 
is still one of the key unresolved problems in modern as- 
trophysics. In the currently accepted hierarchical picture of 



* Based on observations obtained at the Canada-France-Hawaii 
Telescope (CFHT) which is operated by the National Research 
Council of Canada, the Institut National des Sciences de I'Univers 
of the Centre National de la Recherche Scientifique of France, and the 
University of Hawaii. 



Structure formation, galaxies are thought to be embedded in 
massive dark h alos that grow from den sity fluctuations in the 
early universe dFall &Efstathioulll980l) and initially contain 
baryons in a hot gaseous phase. This gas subseque ntly cools. 



and some fraction eventu ally condenses into stars (ILilly et al 



1996t iMadau et al.lll998h . However, many of the physical de- 



tails remain uncertain, in particular the process and history of 
mass assembly. One classical observational way to test those 
models is to classify galaxies according to morphological cri- 
teria, i.e., the organization of its brightness as projected on 
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the sky's plane and observed at a particular wavelength, de- 
fined in the nearby Universe ( Hubble 1936t de Vaucouleurs 



1948 ^, Sandage 19611). and to follow this classification across 



time (iAbraham et al 



19961: ISimard et al.ll2002l: [Abraham et al 



20031) . Comparison of distant populations with the ones found 
in the nearby Universe might help to clarify the formatio n 



history of the galaxy ( ICole et al 



20001: lBaughetal.lll996h 



Progress in this field have come from observing deeper and 
larger samples, but also from obtaining higher spatial reso- 
lution at a given flux and at a given redshift. In the visible, 
progress has been simultaneous on those two fronts, thanks 
in particular to the ultra-deep HDF fields observed with the 
Hubble Space Telescope. HST imaging has brought observa- 
tional evidence that galaxy evolution is differentiated with re- 
spect to morphological type and that a large fraction of dis- 
tant galaxies have peculiar morphol ogies that do not fit into 
the elliptical-spiral Hubble sequenc e ( IBrinchmann et al.lll998 : 
Wolf et allEooilllbert et al.ll2006bl) . 



However, a major obstacle is still the difficulty in quan- 
tifying morphology of high redshift objects with a few sim- 
ple, reliable measurements. Indeed, with the increasing num- 
ber of cosmological surveys available today, classical visual 
classifications become useless and automated methods must 
be employed. Globally there exist two main approaches: the 
first one, known as parametric, consists in modeling the dis- 
tribution of light with an analytic model and fit it to the real 
galaxy. A commonly used parameter in this approach is the 
bulge-to-disk (B/D) light ratio that correlates with qualitative 
Hubble type classificati ons, and can be obtained by fitting a 
two-component profile (ISimard et al.ll2002t IPeng et alJl2002h . 
The main advantage of such a method is that the fitting out- 
put provides a quantitative morphology, i.e. a complete set of 
parameters that describe the galaxy's shape (disk scale length, 
bulge effective radius...). Results are, unfortunately, often de- 
genera ted because of the high numbe r of parameters to be ad- 
justed ( Huertas-Company et aljE007 ). even when the residuals 
are almost null, and the obtained accuracy strongly depends on 
the observing conditions (angular resolution, S/N...). Moreover 
this approach assumes that the galaxy is well described by a 
simple, symmetric profile, which is not true for irregular or well 
resolved objects. 

The second approach is called non-parametric and ba- 
sically consists in measuring a set of well-chosen parame- 
ters that correlate with the Hubble type. The main advan- 
tage of this method is that it does not assume a particu- 
lar analytic model and can therefore be used to classify reg- 
ular as well as irregular galaxies. The resulting morphol - 
ogy will be however more qualitative. ^Abrah am et al. ( 1994 ): 



Abraham et al. d 19961) first proposed this method by defining 



the concentration and asymmetry (C and A) parameters. They 
showed that plotting those values in a 2D plane, results in 
a quite good separation between the three main morpholog- 
ical types (early type, late type and irregulars). Subsequent 
authors modified then the original definitions to make C and 
A more robust to surface- brighntess selection, centering er- 
rors or redshift dependence dBrinchmann et al.ll998 ; Wull999t 
Bershadv et al. I2OOOI: IConselice et al.1 12OOOI) and introduced 



ness (S) was proposed by IConselice et al.l (120031) and gave its 
name to the CAS morpholog ical classification syst em. More 
recently lAbraham et alJ (l2003h and lLotz et al.l (120041) proposed 
two new parameters: the Gini coefficient that correlates with 
concentration and the M20 moment. Each of those parameters 
brings a different amount of information concerning the galaxy 
shape. There is no way, however, with classi cal approaches 



to use more than 3 parameters simultaneously. iBershadv et al 



(l200flh made a first attempt to do a multi-parameter anal 
ysis on a nearby sample using a 4 dimensional space in- 
cluding concentration and asymmetry as well as luminosity 
and color information. They found indeed correlations be- 
tween those parameters and defined six 2D planes resulting 
from the combinations of those parameters. The classifica- 
tion was however done independently in each plane without 
considering all the information si multaneously. In the frame- 
work of the COSMOS co nsortia dScoville & COSMOS Team 
2005h . lScarlata et al.1 ( I2OO6I) have recently made a step forward 



by proposing a multi-parameter classification scheme (ZEST) 
based on the positions of galaxies in a three dimensional space 
resulting from a principal component analysis on a 5 dimen- 
sional space. The method uses almost all the information con- 
tained in the 5 parameters, but the final calibration is done in 3 
dimensions. 

Indeed, one key point in this kind of analysis is to correctly cali- 
brate the volume, i.e. to correctly estimate the decision regions. 
One approach, is to use boundaries defined in the nearby uni- 
verse from a visually classified sample and assume they will re- 
main unchanged for a sample at high redshift, o bserved at a dif 



ferent wavelength and with an other instrument (IAbraham et al 



19961) . However, it is well known that the galaxy morphol- 



ogy depends on wavelength (K-correction) and on the observ- 
ing conditions, that's why some cor rections should be applie d 
to take these effects into account dBrinchmann et al. 1998). 
Another approach consists in classifying visually a fraction of 
the sample and plot the boundaries according to the positions 
of ga laxies in the space ( Menanteau et al.l[2006l : Scarlata et al 



2006h . This of course takes into account the observing con- 



new parameters. In particular a third parameter the smooth- 



ditions of the sample but requires enough resolution and S/N 
to be able to decide visually the galaxy morphology. This is 
possible for space observations but becomes more difficult for 
ground-based observations, where the low resolution doesn't 
allow a reliable visual classification. In all these approaches, 
boundaries are forced to be linear (2D lines or hyper-planes) 
and are generally plotted manually which introduces a subjec- 
tive element that turns more difficult a correct estimate of er- 
rors. 

In this paper, we propose a generalization of the non- 
parametric classification that uses an unlimited number of di- 
mensions and non-linear separators, enabling to use simultane- 
ously all the information brought by the different morphologi- 
cal parameters. The approach uses a particular class of learning 
machines (called support vector machines) that finds the opti- 
mal decision regions in a volume using a training set. Here, we 
build this training set from a local sample that is transformed to 
reproduce the physical and instrumental properties of the sci- 
ence sample, allowing to use it even on seeing limited obser- 
vations. The algorithm defines, in an automated way, the op- 
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timal decision regions using multi-dimensional hyper-surfaces 
as boundaries. It allows therefore a straightforward comparison 
between different science samples. The classification scheme 
that we propose is intended as a framework for future studies 
on large cosmological fields. 

The paper proceeds as follows: generalities on pattern 
recognition and in particular on support vector machines 
(SVM) are described in the next section. In Section[3]we make 
sure that SVM works properly when applied to a nearby sam- 
ple. In SectionlH we describe the general steps of the proposed 
method to classify high-redshift objects. We show, in particular, 
how the training set is built to reproduce the real sample prop- 
erties (14.1b . we define the parameters measured for the morpho- 
logical classification ( 14.31 ) and we finally describe several tests 
performed to probe the accuracy of the method ( I4.4l i. 

We use the following cosmological parameters through- 
out the paper: Hq - 70kms 'Mpc ' and (Om, f^A, ^^k) = 
(0.3,0.7,0.0). 

2. Generalities on pattern recognition 

Suppose a set of observations of a given phenomenon, in which 
each observation consists of a vector Xi E R", / = 1, I and 
of an associated "truth" y,. For instance, in a classical concen- 
tration and asymmetry classification plane, Xi would be a 2D 
vector whose components are the concentration and the asym- 
metry, and y, would be if the galaxy is irregular, 1 if it is disk 
dominated and 2 if it is bulge dominated. We then call learning 
machine, a machine whose task is to learn the mapping Xi i-> y, 
defined by a set of possible mappings x i-> /(x, a). A particular 
choice of a generates what is called a "trained machine". 

2.1. Support vector machines 

Support vector machines ar e a parti c ular fa mily of learning ma- 
chines, first introduced by IVapnikI (1 1 995b as an alternative to 
neural networks and that have been successfully employed to 
solve clustering problems, specially in biological applications. 

In order to simplify the description of the most important 
points concerning SVM we will focus on a 2 class classification 
problem: {xi,y/},/ - l,...,Zy; € {-1, l),Xi 6 W' without loss of 
generalization. The basic idea is to find an hyperplane that sep- 
arates the positive from the negative examples. If this plane ex- 
ists, the points x that lie on the hyperplane satisfy w.x + b = 0, 
with w normal to the hyperplane, and |^7|/||w|| the perpendicu- 
lar distance from the hyperplane to the origin. d+{dS) will then 
be the shortest distance from the separating hyperplane to the 
closest positive (negative) example. The "margin" is defined to 
be: d+ + d^. The algorithm will then simply look for the sepa- 
rating hyperplane with largest margin. This can be formulated 
as follows: 

1. Xi.w + b > +1 fory/ = +1 

2. Xi.w + b < -I for yi = -1 

The training points for which the equalities hold and whose 
removal would change the solution are called support vectors 
(Figure [U. 



It can be (and it is the most common case) that it does not 
exist a linear hyperplane that perfectly separates the two data 
sets. In this case we can relax constraints by introducing a posi- 
tive stack variable ; = 1, / and the equations become then: 

1. Xi.w + b > +1 - fory,- - +1 

2. Xi.w + b < -1 + fory,- - -1 

The global effect is to change the objective function to be min- 
imized to ||w||-/2 + C(2 ^i), where C is a parameter to be cho- 
sen by the user, a larger C corresponding to assigning a higher 
penalty to errors. 

Another feature that can be added to solve more complex 
problems are non linear decision functions. To do so, we map 
the data to some other (possibly infinite dimensional) Euclidian 
space //: <1> : R'' i-> // where the data can be linearly sep- 
arable by some hyperplane. Since the only way in which the 
data appear in the training problem is in the form of dot prod- 
ucts Xi.Xj then the training algorithm would only depend on 
the data through dot products in H, i.e. on functions of the 
form <J>{xi).<^(xj). If there is a "kernel function" K such that 
K{xi,Xj) = <i>{xj).(b{xj) we would never need to explicitly 
even know what <1) is. Examples of kernels are: K{x,y) - 
(x.y + I)'' (Polynomial), K(x,y) = eS^^'-y^^'iGaussiariRBF). 

In summary, SVM are a particular family of learning ma- 
chines that: 

- for linearly separable data, simply look for the optimal sep- 
arating hyperplane between distributions by maximizing 
the margin. 

- for non separable data a "tolerance" parameter C must be 
added which controls the tolerance to errors. 

- for non linear non separable data a kernel function is built 
that maps the space into a higher dimensional space where 
the data are linearly separable. Then the Kernel parameters 
must be adjusted too. 



2.2. Application to galaxies 

Abraham et al. I (11994 proposed the idea of measuring some 
well-chosen parameters on a galaxy image that can be eas- 
ily correlated with its morphology. In their paper they intro- 
duced the concentration, which basically measures the fraction 
of light contained in an inner isophote, and the asymmetry, 
which measures the degree of symmetry of the galaxy. They 
showed, that plotting those values in a 2D plane results in a 
quite good separation between the three main morphological 
populations: early-type, late-type and irregulars. They conse- 
quently plotted linear separators to define the regions and clas- 
sified a set of galaxies with unknown morphology according 
to their positions in the so-called C/A plane. In other words, 
they tried to maximize the margins between 3 populations in a 
2 dimensional space using Unear separat ors. The same task can 
be done in a 3 dimensional space (CAS. IConselice et al ] l2003h 
but it becomes simply impossible with more than 3 dimensions. 
In this sense SVM offer a straightforward generalization of this 
method since they can separate samples with an unlimited num- 
ber of dimensions and use non-linear boundaries. 
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Fig. 1. 2D illustration of the three cases of SVM classification. 
Top left: linearly separable data with linear boundaries. Top 
right: Non-linearly non-separable data with linear boundaries. 
Bottom: Non-linealry non-separable data with non-lindear bor- 
ders. 



Previous works ( lAbraham et al. Il99d iBrinchmann et al 



19981) have shown that morphological classification is far from 
being a linearly separable problem, since the contamination in 
the C/A plane is quite high. We have chosen therefore to use the 
most general SVM, i.e. a non linear machine with an RBF ker- 
nel. A machine is thus parameterized with two parameters: the 
tolerance (T) and the kernel exponential factor (g). Each pos- 
sible combination of those two parameters generates a family 
of functions fr.gia, xi). T and g must be fixed before perform- 
ing the training and a is the result of the training procedure. 
There exist several techniques for finding the best T and g val- 
ues for a given pro blem; here we will u se a cross-validation 
method described in lChang & LinI (l200lh . It simply consists in 
performing a systematic search over a grid of possible values 
and selecting the pair that gives the best results. 

Our goal is therefore to train a support vector based ma- 
chine to estimate the morphology of a high redshift sample. 
We u se throughout the p aper the free available package lib- 
SVM ( Chang & Linl2001 ). The procedure is basically the same 
as in a classical C/A classification but using a trained SVM to 
plot the optimal boundaries. As we show below, this allows to 
use more than two morphological parameters simultaneously 
and also to measure errors in an automated and objective way, 
which is capital to compare different classifications. 
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Fig. 2. C/A classical classification (left) versus C/A SVM clas- 
sification (right) of 500 nearby objects. Triangles are galaxies 
visually classified as late-type and circles are galaxies visually 
classified as early-type. Numbers show the probability that the 
predicted morphological type is the same as the visual one. 

quently well-resolved and with a high S/N. Classical C/A clas- 
sifi cations have been proved to give good results in such cases 
(e.g lAbraham et a l. 1996; Menanteau et al. 2006) , therefore the 
idea is to verify that we get at least the same results using SVM 
and that no extra-biases are introduced. 

We thus measure concentration and asymmetry parameters 
(see 14.31 for details) and, on the one hand we try to plot the 
best linear boundary by eye as usually done to separate galax- 
ies in two classes (late type and early type); on the other hand 
we train a SVM and finally, we compare the outputs. Figure|2] 
shows the two resulting boundaries. The shape of the bound- 
aries are quite different since SVM does not produce a lin- 
ear boundary but when looking at the global accuracy we see 
that both methods are fully consistent. Indeed, the complete- 
ness (the fraction of visual classified galaxies that are correctly 
recovered) and the contaminations (the fraction of visual clas- 
sified galaxies that are misclassified) are practically the same 
for the two methods (see Table[T]). 

To confirm this consistency and to verify that no extra bi- 
ases are introduced, we also made a one-to-one comparison of 
all the galaxies classified with the two methods. We obtain that 
98% (94%) of the early-type (late-type) galaxies classified with 
the classical C/A method are also classified as early-type (late- 
type) using the trained SVM. 

We conclude that for a high S/N well-resolved sample, such 
as the SDSS sample, the use of SVM to plot the boundaries is 
equivalent to use classical procedures. The major advantage, 
however, is that the boundary is plotted automatically to mini- 
mize the errors. 



3. Test on a well-resolved nearby sample 

3.1. Classical C/A classification versus 2-D SVM 

In order to verify that SVM work properly when applied to 
morphological classification of galaxies, we start with a sim- 
ple test, i.e. classifying a local sample from the Sloan Digital 
Sky S urvey (SDSS) in the i band that has been visually clas- 
sified (ITasca & White! l2005h . Galaxies are nearby and conse- 



3.2. Classical C/A classification versus 4-D SVM 

Since one of the main advantages of using SVM is that it can 
work with an unlimited number of parameters, we investigate 
the effect of adding dimensions to the SVM classification. We 
thus classify the same sample as above but with four morpho- 
logical parameters instead of two: Concentration, Asymmetry, 
Smoothness and Gini (see 14.31 for details on how they are 
calculated) and compare the outputs. Results are shown in ta- 
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ble [U We see that there is no significant gain for this par- 
ticu lar case. This sugges t s that, as proven in pre vious works 
(e.g [Abraham et al" 1996 : Menanteau et al. 2006 ). when deal- 
ing with a well-resolved and high S/N sample, concentration 
and asymmetry are enough to obtain an accurate morphologi- 
cal classification. 

4. Going to higher redshift... 

When observing objects at higher redshift with a ground-based 
telescope the S/N decreases, galaxies become poorly resolved 
and conseque ntly more symmetric and less concentrated (e.g. 
Conselice et al. 2000). The separation in the C/A plane turns 
out to be less clear. That's why space data such as HST imag- 
ing are widely used for those purposes and classifications 
based on co l ors are usually adopted for ground-b ased data (e.g 



Zucca et al.1 12006|) 



It is known however (e.g. lArnouts et al 

2007h that a classification based only on colors is highly con- 
taminated by the presence, for instance, of an important pop- 
ulation of "blue" early-type galaxies, specially at high redshift 
where the red sequence is building up. That is one of the rea- 
sons why classifications based on morphological criteria are 
preferred. Indeed, with the increasing amount of data coming 
from ground-based surveys becoming available today it would 
be interesting to know if it is possible to obtain at least a rough 
morphological classification from these observations. In the 
following sections we therefore investigate wether the possi- 
bilities of using a large number of parameters and non-linear 
boundaries offered by support vector machines can help to in- 
crease the accuracy of "pure" morphological classifications on 
high-redshift ground-based data. 

4.1. Description of tine employed metiiod 

The proposed procedure can be summarized in 4 main steps 
(Figure |3]i: 

1. Build a training set: for that purpose, we select a nearby 
visually classified sample at a wavelength corresponding to 
the rest-frame of the high redshift sample to be analyzed. 
We then move the sample to the proper redshift and image 
quality and drop it in the high z background. This is fully 
described in Section l4~2l 

2. Measure a set of morphological parameters on the sample. 

3. Train a support vector based learning machine with a frac- 
tion of the simulated sample and use the other fraction to 
test and estimate errors. 

4. Classify real data with the trained machine and correct for 
possible systematic errors detected in the testing step. 

In the following sections, we describe each of the steps enu- 
merated above. 



4.2. Tlie training set 

The most important step in obtaining the morphology with 
a non-parametric method is to correctly calibrate the volume 
filled by the data in the multi-dimentional space. This is a criti- 
cal step since it will determine the decision regions that will be 
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Fig. 3. Steps for morphological classification (see text for de- 
tails). 



used to perform the classification. Indeed, galaxy morphology 
depends on the physical properties of the galaxy (luminosity, 
redshift, wavelength) and on the observing conditions (back- 
ground level, resolution). A suitable calibration set should con- 
sequently reproduce closely all the properties of the sample to 
be analyzed. One classical approach consists in visually clas- 
sifying a fraction of t he sample and use it as a trai ning set to 
optimize boundaries (IMenanteau et al. I ll999[ l2006h . However 
this is not possible for seeing limited data where the resolution 
is too low to enable a reliable visual classification. Here, we 
then decide to simulate the high redshift sample from a visu- 
ally classified local catalog, selected in the rest-frame color of 
the high redshift sample. This has three main advantages: first, 
it's free from K-correction effects, second it does not introduce 
any modeling effect, since the used galaxies are real and finally, 
the training set is built to reproduce the observing and physi- 
cal properties of the sample to be analyzed, but it is classified 
locally, so it does not need to have a specially high resolution. 



4.2.1 . Real sample 

In order to test the method, we work on a sample of galax- 
ies observed with WIRCam at CFHT in the near infrared 
Ks band. The field is part of the Canada-France Hawaii 
Telescope Legacy Survey (CFHTLS) Deep survey and its near 
infrared follow-up and it is cente red on the COSMOS area 
dScoville & COSMOS Teaml2005h . We use a cutout of 1 x 1 0' 
to perform all the tests. The sample is complete up to K{AB) - 
22 and the median photometric redshift is ~ 0.8 (Fig. |4|l. 
Images are reduced wit the Terapix pipelineQ and have a pixel 
scale of 0.15 with a mean FWHM of 0.7 . These data are par- 
ticularly interesting because K-band data have the advantage 
of probing old stellar populations in the rest-frame, enabling 
a determination of galaxy morphological types unaffected by 
recent star formation. Moreover, no space data in this wave- 
length range are available today. Photom etric redshif t s come 
from the publicly available catalogue from lllbert et al. (2006a) 



http://terapix.iap.fr 
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Fig. 4. Magnitude and redstiift distributions of the real and sim- 
ulated sample. Solid line: real sample. Dotted line: simulated 
sample. Error bars show poissonian errors for the real sample. 
See text for explanations concerning the differences between 
the simulated and real distributions. 



computed with the LePhar^ code on the CFHTLS Deep sur- 
vey Terapix release T0003 and its multi-color photometric cat- 
alogs. 



4.2.2. Building the sample 

We used therefore a local catalog of 1472 objects from the 
Sloan Digital Sky Survey in the i band, which roughly corre- 
sponds to the rest-fram e of the K-band at z ~ 1 and that has 
been visually classified (iTasca & Whitell2005h . 



We first generate a random pair of (magnitude, redshift) 
values with a probability distribution that matches the real mag- 
nitude and redshift distribution of the sample to be simulated 
(see Figure|4]i. 

Then, for every galaxy stamp, we proceed in four steps: 




Fig. 5. Example of simulation for a galaxy. 1: SDSS i band im- 
age; 2: image after subtraction of foreground stars; 3: image af- 
ter convolution; 4: image after binning; 5: final simulated field 
with real and simulated galaxies. 



• / ), where f, is the local galaxy's initial 



foreground stars that fall within the galaxy, are replaced 
with the mean value in the galaxy area. 
Second, we degrade the resolution to reach the one at high 
redshift: we measure the FWHM at high redshift (//,,), con- 
vert it to Kpc using a standard ACDM cosmology and de- 
duce the resolution the local galaxy must have (//-). Then 
the image is convolved with a 2D gaussian function of 

FWHM = 
resolution. 

Third, the image is binned to reach the expected angular 
size at high redshift with the 0.15 pixel scale. In this step, 
the image is also scaled to its new magnitude. In the scaling 
procedure we force the final mean background level of the 
simulated stamp to be at least 3 times lower than the real 
background. This is to avoid that the local noise dominates 
over the high-redshift noise when dropping the galaxy in a 
real background. This implies that too bright objects (typ- 
ically Ks < 17 ) cannot be simulated since the necessary 
scaling factor is too small and explains the difference be- 
tween the real and the simulated magnitude distribution in 
figure m The difference in the faint end is due to the fact 
that some simulated objects are not detected by SExtractor. 
Finally, we drop the galaxy in a real background image. 



1. First, we remove all the foreground stars and all other 
sources that do not belong to the galaxy itself. We 
use for that purpose th e SExtractor segmentation map 
dBertin & Arnouts Il996h and replace all the surrounding 
sources with a random noise with same statistics (mean 
value and variance) than the real background noise. The 



http://cencas.oamp.fr/cencos/CFHTLS/ 



Figure |5] illustrates the entire procedure for a spiral galaxy. 

In summary, we simulate a high redshift sample from a lo- 
cal sample selected in the high redshift sample's rest-frame to 
avoid K-correction effects. The sample reproduces the observ- 
ing conditions (background level, noise, resolution) and physi- 
cal properties (redshift and magnitude distribution) of the sam- 
ple to be analyzed. 
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4.3. Measuring morphological parameters 

Once the simulated galaxies are dropped in a real background, 
we measure the following 5 morphological parameters: 

- Concentration: basically, it measures the ratio of light 
within a circular or elliptical inner aperture to the light 
within a circular or elliptical outer aperture. Generally, it's 
defined in sl ightly different way by different authors. Here 
we adopt the lBershadv et al.l (120001) definition as for the ra- 
tio of the circular radii containing 20% and 80% of the "to- 
tal flux": 



C = 5log{rgo/r2Q) 



(1) 



We use Conselice's (2003) definition of the total flux as 
the flux contained within l.Srp (Petrosian radius). For the 
concentration measurement, the galaxy's center is that de- 
termined by the asymmetry minimization (see below). 
Asymmetry: it quantifies the degree to which the light of 
a galaxy is rotationally symmetric. It is measured by sub- 
tracting the galaxy image rotated by 180" from the original 
image: 



1 l i:\l(ij)-lmiij)\ 

2 1 zmj) 



i:\Bii, j)-Bm{iJ)\ 



(2) 



where I is the galaxy image and /iso is the galaxy image 
rotated by 180 about the galaxy's central pixel and B is the 
average asymmetry of the background. The central pixel is 

determined by minimizing 

Smoothness: developed bv lConselice et al. I (l2000h . it quan- 
tifies the degree of small-scale structure. The galaxy image 
is smoothed by a boxcar of given width and then subtracted 
from the original image: 

1 l i:\l(ij)-lsiij)\ _ Z\B(i,j)-Bs{iJ)\ \ 

2 1 zmj) zmj) /' 

where Is is the galaxy's smoothed by a boxcar of width 
0.25r„. 



Moment of Light: introduced by iLotz et alj ( 120041) . the 
total second-order moment M,,,, is the flux in each 
pixel / multiplied by the squared distance to the cen- 
ter of the galaxy, summed all over the galaxy pixels 
assigned by the SExtractor segmentation map: M,oi = 
2 fi[ixi - Xcf + iyi - yc)^], where and is the galaxy's 
center. The second-order moment of the brightest regions of 
the galaxy traces the spatial distribution of any bright nu- 
clei, bars, spiral arms and off-center star clusters. We define 
M20 as the normalized second-order moment of the 20% 
brightest pixels of the galaxy. 

Gini Coefficient: it is a statistic based on the Lorentz curve, 
i.e. the rank-ordered cumulative distribution function of a 
population's wealth o r in this case a galaxy's pixel values 
([Abraham et al.ll2003l) . For the majority of local galaxies, 
the Gini coefficient is correlated with the concentration in- 
dex and increases with the fraction of light in a central com- 
ponent. However, unlike C, G is independent of the large- 
scale spatial distribution of galaxy's light. Therefore, G dif- 
fers from C in that it can distinguish between galaxies with 



shallow light profiles (which have both low C and G) and 
galaxies where much of the flux is located in a few pixels 
but not at the center (which have low C but high G). 

Each of the above parameters, measure different properties 
of a galaxy and give therefore a different amount of informa- 
tion conce r ning t he galaxy's morphological type. For instance, 
Lotz et al.l (120041) used the Mao/Gini plane to identify merger 
candidates whereas the C/A plane is classicaly used to sepa- 
rate late from early type galaxies. A multi-dimensional analy- 
sis allows consequently to use simultaneously all the informa- 
tion brought by each of the morphological parameters to in- 
crease the accuracy of the classification. Moreover, previous 
works have shown that the measured parameters might also de- 
pend on the size, the luminosity or the redshi ft of the galaxy 
( IBrinchmann et alJll998uBershadv et al. I I2OOOI) . Therefore, in- 
cluding non-morphologcial parameters should help the ma- 
chine to take into account systematic trends in the morpholog- 
ical parameters due to luminosity or size variations. We thus 
measure 7 more parameters that we distribute in 4 classes: 
shape, size, luminosity and distance, according to the kind of 
information they measure: 

- Shape: we include 2 shape parameters: the galaxy ellipticity 
as measured by SExtractor, i.e. the ratio of the minor and 
major axis of the isophotal elhpses describing the galaxy, 
and the CLASS_STAR parameter also from SExtractor. 
This parameter is intended to separate galaxies from stars 
and results from a neural network classification. Since it 
spans a continuum range between and 1, it can be inter- 
preted as a measure of the galaxy's compactness. 

- Size: the size parameters include the isophotal galaxy area 
and the petrosian radius. 

- Luminosity: we use the apparent magnitude of the galaxy 
and the mean surface brightness. 

- Distance: we adopt the photometric redshift as a measure 
of the distance. 

4.4. Training and testing 

We perform several tests to probe the accuracy of the proposed 
method. For all the tests we adopt the same procedure: we use 
a fraction of the simulated catalogue (typically 500 galaxies) to 
train the machine and the remaining 1000 objects to test it by 
looking at the fraction of galaxies that are correctly classified. 
We limit the analysis to only 2 broad morphological classes 
(late type and early type). The main reason for this choice is 
that there are too few irregular galaxies in the employed local 
sample to define a class. There is however no loss of generaliza- 
tion and the same analysis can be performed with an unlimited 
number of classes, provided of course that they are correlated 
with measured parameters. 

4.4.1 . Classical C/A classification versus 2-D SVM 

The first point we try to answer is how good would be the clas- 
sification of this sample using a classical linear C/A classifi- 
cation. We thus get the brightest objects of the sample (with 
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known visual morphology) (Kg < 20) and try to plot a linear 
boundary between the two distributions. As expected, the dis- 
tributions are now poorly separated and plotting a linear bound- 
ary becomes extremely difficult. This is confirmed when trying 
to classify the whole sample (Table [2]i: the completeness and 
contaminations are basically the same that we would have ob- 
tained with a random choice. We conclude that concentration 
and asymmetry alone cannot be used on this sample to obtain a 
reliable morphological classification. 

In a second step, we classify this sample with a SVM ma- 
chine with the same two parameters. Results are shown in ta- 
ble |2l We observe, a slight gain due to the fact that SVM can 
adapt boundaries in a non-linear way, but the accuracy is still 
comparable with a random choice. 

4.4.2. n-D versus 2-D SVM 

Global effect We trained then 2 machines: the first one, with 
only 2 parameters (C and A), which should globally give the 
same results as a classical C/A classification as shown in sec- 
tion[3]and the second one with 12 parameters described above. 
We then tested both machines by looking at the fraction of 
galaxies that are correctly classified. Results for the whole sam- 
ple are summarized in table |2] We observe that including more 
than two parameters in the classification results in a significant 
gain for this sample where C/A cannot do much better than a 
random choice. Indeed we almost recover the same accuracy 
that was obtained on the nearby sample (Table [T]). 

Robustness We now try to establish the robustness of this ef- 
fect. For that purpose, we look at the accuracy of the classifica- 
tion as a function of 3 main properties of the galaxies: luminos- 
ity, distance and area (Fig. ^ by progressively adding objects 
and measuring each time : a) the global accuracy, i.e. the frac- 
tion of galaxies that are classified correctly by the machine, and 
b) the accuracy per morphological type, i.e. the fraction of pre- 
dicted early (late) type galaxies that are visually classified as 
early (late) type respectively [Ne^e and Ns^s)- 

Several conclusions can be extracted from this comparison: 

- First, using more than two parameters simultaneously 
clearly increases the global accuracy of the classification 
in all the redshift, area or luminosity ranges. Indeed, the 
mean fraction of correct classifications in the 2 dimen- 
sional machine is around ~ 60% and decreases to ~ 50%, 
which means that there is a high contamination in the C/A 
plane, whereas it rises up to more than ~ 80% when us- 
ing a 12 dimensional machine whi ch is comparable of what 
is obtained in space ob servations ( IBrinchmann et al.lll998 ; 
Menanteau et alj|2006h . 



in an important bias towards a too high fraction of ellip- 
tical galaxies. However, in the 12 parameter classification, 
the accuracies are almost perfectly symmetric for the two 
morphological types. 
- Third, when looking at the evolution as a function of dis- 
tance, size and luminosity, the 12 dimensions machine re- 
sults in a more stable response in particular as a function of 
magnitude and redshift. 



4.4.3. How to fix the number of parameters? 



- Second, the gain is even higher when looking at the Ne^e 
and Ns^s coefficients. For the C/A classification, there is 
indeed an asymmetric response of the machine: early type 
galaxies are better identified ( 65%) whereas the fraction 
of late type is significantly lower ( 50%), which means that 
an important fraction of late type galaxies are classified as 
early type. This will result, when doing the classification, 



In section l4.4.2l it is shown that the use of more than 2 dimen- 
sions to obtain morphology clearly increases the accuracy of 
the global classification. However, the questions that arise are: 
are all these parameters necessary? Might some parameters in- 
troduce a degeneracy and consequently reduce the machine's 
accuracy? 

To try to answer these questions we make a single test that 
consists in training several machines with an increasing num- 
ber of parameters. We thus start with a classical 2 parameter 
machine (C and A) and we progressively add dimensions un- 
til we reach the 12 dimensions described in previous section. 
Results are plotted in Figure]?] As above, we plotted the global 
accuracy and the one per morphological type. The dimensions 
are separated into 5 categories (morphology, shape, size, lumi- 
nosity and distance). 

Two important points arise at first sight: first, not all the 
parameters bring the same amount of useful information. The 
morphology and the shape carry practically the necessary 
amount of information to reach 80% accuracy. Second, the ac- 
curacy is a monotonic function of the number of parameters: 
adding a parameter can result in almost an unchanged accuracy 
(for instance, the magnitude) but never reduces it. This is par- 
ticularly important, since it means that including more parame- 
ters than necessary does not result in a degeneracy. In addition, 
adding does not result in a significant increase of the computing 
time. 



4.4.4. Influence of the training set 

The method we adopted here for building the training set aims 
at reproducing the observing conditions and physical proper- 
ties of the sample in order to reduce errors due to the difference 
between the training and the science samples. The machine is 
thus trained to solve a specific problem and should be trained 
differently for every science sample. We now measure the im- 
portance of this effect by simulating the same sample as if it 
was observed by the adaptive optics system NACO installed 
on the VLT. We use NACO data that have been observed in 
the Ks band with 2 to 3 hours exp osure time for each point- 
ing (IHuertas-Company et al.ll2007l) . The total area covered by 
these data reaches 7 arcmin^ and the mean resolution is 0. 1 . 
We therefore repeated the same procedure but dropped the sim- 
ulated catalogue in a real NACO background. We then trained 
the machine with this sample and try to classify the WIRCam 
simulated sample with the trained machine. 
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Fig. 6. Cumulative accuracy of classifications for a 2D machine (left column) and a 12D one (right column) as a function of 
magnitude (a and b), area (c and d) and redshift (e and f). Solid line shows the global accuracy, i.e. the number of galaxies 
correctly identified, dotted and dashed lines show respectively the fraction of early type and late type galaxies classified correctly. 
Stamps in the right column show a typical galaxy for every magnitude, area and redshift range. 



Results are shown in table [3j the global accuracy of the 
classification falls to 62%, i.e. 40% of contaminations when 
using the NACO model to classify WIRCam galaxies. In par- 
ticular, there is a systematic drift from late to early type galax- 
ies. The training set must thus be carefully built to take into 
account all the observing conditions. 



5. Summary and conclusions 

We have presented a new method to perform morphological 
classification of cosmological samples based on support 
vector machines. It can be seen as a generalization of the 
classical non-parametrical C/A classification method but with 
an unlimited number of dimensions and non linear boundaries 
between the decision regions. The method is specially adapted 
to be used on large cosmological surveys since it is fully 
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Global correct 
Global errors 
Late type correlct 
Late type error 
Early type corr ^ct 
Early type erro s 



C/A Gini Smootti IV120 Ellip ST_CL Area Petro. rad. Mag. M.S.B ^ 

Parameters 

Fig. 7. Accuracy of the classification as a function of the num- 
ber of parameters. The first point corresponds to a classical C/A 
classification and each new point adds a dimension. Parameters 
are classified in 5 classes: morphology, shape, size, luminosity, 
and distance. 

I WIRCam Model I NACO Model 



Global 

Ne^e 
Ns^s 




0.62 
0.96 
0.24 



Table 3. Accuracy of the classification when using a machine 
trained with a sample with different properties than the science 
sample - see text for details. 



automated and errors are estimated objectively allowing an 
easy comparison between surveys with different properties. 
Furthermore, since the calibration sample is built from a 
nearby sample visually classified adapted to reproduce the 
physical and instrumental properties, the method can be even 
employed on seeing-limited data. Selecting the calibration 
sample in the high redshift sample's rest-frame turns the results 
robust towards wavelength dependent effects and makes it 
easier to interpret them in terms of evolution. 

As a test, we use our method to classify a near-infrared 
seeing-limited sample observed with WIRCam at CFHT with 
a training set of ~ 1500 objects from the SDSS. We show that 
increasing the number of parameters in the analysis reduces er- 
rors by more than a factor 2; leading to a mean accuracy of 
~ 80% of correct classification up to the sample completeness 
Umit (Kab ~ 22). The accuracy is furthermore a monotonic 
function of the number of parameters. 

The presented method is intended as a framework for 
future studies. In particular, it can be used to look for lu- 
minosity and color evolution as a function of the morphol- 
ogy. However this method is far more general and can be 
appplied on many other samples of galaxies observed with 
ground-based data with or without AO correction. Several 
applications will be intended in order to study the effects 



of local environment and galaxy density on the morpholog- 
ical evolution of galaxies both in the field and in rich clus- 
ters of galaxies. The library is available for download at 
|http : //www ■ lesia ■ obspm . f r/~huertas/ galsvm . html. 
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Classical C/A 
Early-Type Late-Type 


SVM C/A 
Early-Type Late-Type 


SVM 4-D 
Early-Type Late-Type 


Visual Early-Type 
Visual Late-Type 


0.80 (254) 0.09 (17) 
0.20 (65) 0.91 (172) 


0.79 (256) 0.08 (15) 
0.21 (72) 0.92 (166) 


0.79(251) 0.10(20) 
0.21 (67) 0.90(171) 



Table 1. Comparison of the accuracy of three classifications of the SDSS sample: Classical C/A, SVM C/A and 4-D SVM. The 
table shows for each method the relations between the visual and the predicted morphological classes. The number of objects are 
enclosed in parentheses, (see text for details) 





Classical C/A 
Early-Type Late-Type 


SVM C/A 
Early-Type Late-Type 


SVM 12-D 
Early-Type Late-Type 


Visual Early-Type 
Visual Late-Type 


0.59(96) 0.51 (321) 
0.41 (65) 0.49 (309) 


0.57 (304) 0.45 (113) 
0.43 (236) 0.55 (138) 


0.75 (365) 0.18 (52) 
0.25 (149) 0.82 (225) 



Table 2. Comparison of the accuracy of three classifications of the WIRCam sample: Classical C/A, SVM C/A and 12-D SVM. 
The table shows for each method the relations between the visual and the predicted morphological classes. The number of objects 
are enclosed in parentheses, (see text for details) 



